Okkhor: A Synthetic Corpus of Bangla Printed Characters
Mridul Banik, Md Jamiur Rahman Rifat, Jebun Nahar, Nazmul Hasan, Fuad Rahman
Accepted to be presented at FTC 2020 - Future Technologies Conference 2020, 5-6 November 2020, Vancouver, Canada
Description
Bangla is the fifth most-spoken native language in the world.
Despite having such a large number of speakers, the resources related to
development of language processing solutions are very limited. To realize
the full potential of Machine Learning (ML) and Artificial Intelligence (AI)
solutions for computer vision and Natural Language Processing (NLP), a
complete and standardized fully-annotated corpus is an essential prerequisite.
Specifically, development of Optical Character Recognition systems (OCRs)
for printed characters, an important resource for language automatic and
digitization, requires a large corpus with high coverage and variability of
fonts, representing the nuances of the language usage, which does not exist
for Bangla. In this paper, we present a novel synthetic corpus of over 5
million printed Bangla characters containing 60 alphanumeric characters, 10
vowel modifiers, 159 compound characters, which corresponds to 229
different classes of both Unicode and ASCII encodings. This is entirely
novel work, since there exists no such corpus currently for the Bangla
language